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The identification of abnormal situations in information and 
telecommunication systems is considered, based on analyze statistical 
information of network traffic packages. The method of identifying an 
anomalous situation based on segmentation of data sample is proposed. The 
method is aimed at using classifying algorithms that have the best quality 


indicators on individual data segments. The proposed method will be useful 
for monitoring information security systems. The method registers of factors 
Keywords: that affect the change in the properties of targeted variables. Impact detection 
allows you to generate data samples, depending on current and expected 
situations. On the example of the NSL-KDD dataset, there was a division of 
° ; many data into subset, taking into account the influence of the factors on the 
Information security range of values. The processing of factors is shown using the change point 
Network traffic detection function in the time series. With its use, a division of data sample 
Segmentation by the final number of non-intersecting measurable subsets has been made. 
The results of Accuracy, Precision, F-Measure, Recall for various classifiers 
are shown. The proposed method allows to increase the quality indicators of 
classification in continuously changing operating conditions of 
telecommunication systems. 
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1. INTRODUCTION 

The evolution, the widespread use of information telecommunication systems (ITS) determines the 
swift growth of network traffic, which must be processed and analyzed. To solve these problems, methods 
based on clustering, classification and prediction are used [1]-[5]. Depending on the characteristics of 
information systems, network traffic can have various properties determined by the volume, frequency of 
service and information messages. The architecture and structure of ITS makes it possible to divide it into 
separate components, where in each segment of the sequence of messages and packages will have their own 
propetties. 

The presentation of network traffic in the form of models based on discrete states allows you to use 
machine learning methods to identify destructive influences that may occur in the system [6], [7]. However, 
very often, during the operation of the information system, over time, a change in the ranges and distributions 
of the output and input data may occur. The achievement of specified indicators in determining destructive 
influences is associated not only with machine learning methods, but also with the properties of data in the 
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samples. In this regard, there is a need to adapt the methods of machine learning to emerging changes in the 
range of values of targeted variables. The presence of data repeating the properties of a general population is 
in many cases more important than the classifying algorithm, which determines the need to create 
representative samples. The presence of data repeating the properties of the general totality is in many cases 
more important than the classifying algorithm, which determines the need to create representative samples. 

The achieved qualitative indicators of models depend on classifying algorithms and data properties. 
An analysis of the properties of observed objects is as important as the quality of the training methods used. 
The works of [8]-[12] dependences of the qualitative indicators of various classification algorithms are 
investigated. The presence of data repeating the properties of data set is in many cases more important than the 
classifying algorithm, which determines the need to create representative samples. 

Improving the quality of machine learning models is achieved by the use of various approaches and 
directions. The first is associated with the ensembles of classifying algorithms trained on a subsets of the data 
[13]-[15]. The essence of such methods is to combine forecasts of models. However, they are not universal, 
have difficulties associated with the formation of a model of classifiers that evaluate reliability. 

The second direction is based on the control of probability distribution. Such methods are aimed at 
detecting possible changes in the processed data. They require a large number of resources, and, in certain 
situations, do not always allow to accurately and unequivocally determine the boundaries of the segment [16], 
[17]. The third direction is the development of models associated with forecast analysis of data behavior [16], 
[18], [19]. They are based on preliminary knowledge of the features of concepts that may be contained in the 
data and their changes during the time. When there is a great number of analyzed target variables, complex 
models are obtained that require computational costs. 

In most cases, the methods used today are highly specialized and require significant costs for 
implementation [19]. The problem is that it is difficult to determine in advance which of the selected methods 
will provide a solution with a given quality. In this regard, various methods and their combinations are used, 
and the decision to select the right model depends on the quality of functionality for the control sample. The 
article proposes to consider the factors affecting the properties of traffic by solving the problem of segmentation 
of data sample and form a strategy that prescribes a classifier to segment of the sample. 


2. METHOD 
2.1. Formalization of the proposed method 

In the tasks of machine learning, the main problematic issue is the formation of data samples. In 
practice, situations arise when traffic properties change during the functioning of ITS. For example, depending 
on the number of users on the network, there is a change in the volume of data in the day and night hours. 
Separate difficulties for classifying algorithms causes a heterogeneous attribute space, the formation of which 
takes into account various messages and their internal structure. The same messages with various flags indicate 
the occurrence of different events on the network. At the same time, as the system functions, changes in the 
ranges and distributions of the studied variables may occur. The data sample obtained under such conditions 
does not always representatively reflect the distribution of events, which can lead to the effect of “scattering” 
of answers and influence the quality of analysis. 

The tuple of values X = (x4, ...,X,) characterizing network traffic has many parameters. During the 
operation of the system, the frequency of both informational and service messages with various flags may 
increase during a certain point in time. For example, the appearance of a relatively large number of messages 
with the <SYN> flag may indicate a possible attempt to connect to the network [20], [21]. And this, in turn, 
provides information to define the legality of these attempts. Using quantitative characteristics, using a marked- 
up sample based on “historical” experience, it is possible to determine the normal and anomaly state. Denote 
{c,, C2} = c as normal and anomaly condition labels of ITS. 

Quantitative values of attributes x1, ...,X, are predictors, based on the analysis of the values of which 
it is necessary to most accurately correlate the specific object c to their group - normal c, or an anomaly cz 
state. In this case, the identification of the ITS state is considered as the task of machine learning, defined in 
the compact space X and marks c, involving the creation of an algorithm: 


aX>oc (1) 


in order to determine the qualitative indicators of the classifier a in (1), we define the function of loss L, which 
compares the prediction with the label. 

Using the proposed method can be considered in the classification tasks. Consider the error indicator 
as a function for measuring the losses of the classification algorithm a(x;) acting on the sample X? (where p 
is the number of tuples of sample). 
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I(x, a) = [c; # a(x;)] (2) 
The frequency of error (2) of algorithm a(x) used to analyze losses is determined by the expression: 


L(a,X?) = “YP 1(G,a(%)) 3) 


V factors affect the registered data. They can be defined clearly. For example, working and non-working hours, 
can have a significant impact on the volume of network traffic. However, due to the possible simultaneous 
exposure, it is not always possible to unequivocally interpret them, which leads to the need to analyze the data 
sample with automatic methods, for example, searching for a signal breakdown or detecting a concept drift [22]. 
The influence of external and internal factors on ITS leads to the fact that the data sample becomes 
heterogeneous, and heterogeneities arise as a result of the influence of factors. 

To increase the qualitative indicators of machine learning methods, which are affected by data 
emissions, noisies, changes in the density of the probability of events, there is a need to divide the set X? into 
subsets, given the influence of factors v; € V,i = 1,...,m. 


RPS A UX UN py So, 


Then it is necessary to minimize the function of losses for each subset X ‘4 ‘ € XP, where the factor or their 
collection v; affects. 


L(a;,X7') > min (4) 


The use of pre-selected classification algorithms on the basis of expression (4) makes it possible to 
determine for each segment X, : ' its classifier that has the best values of the function of losses. The selection of 
a classifier with the best quality indicators on the data subsample is determined by the expression (5). 


a(x) = argmin L(a;,X;") (5) 
ajeAXx; 'ex 


Losses on the entire sample must be minimized using various classifiers predetermined on each segment. 
m Di : 
dizi Lj(a;,X;') > min (6) 


The use of expression (6) on each segment of the sample allows you to choose a group of classifiers where 
each of them has the best indicators on the segment predetermined to it. 

Figure 1 shows an illustration of the suggested method. A set of classifying algorithms is defined. The 
input sequence is divided into separate segments, where the loss function is calculated for each of the analyzed 
classifiers. Depending on the values, each segment is assigned its own model. 
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Figure |. Proposed method illustration 


Segmentation of data when analyzing the state of telecommunication systems (Ilya Lebedev) 


1476 O ISSN: 2502-4752 


Thus, the proposed method is based on segmentation of the sample. On each segment, the properties 
of all predefined models are calculated, and the most suitable one is assigned. Unlike ensemble approaches, 
the proposed method avoids the effect when weaker algorithms degrade the overall result of the model, has 
less resource intensity. At the same time, it is possible to use models that are easily interpreted for each segment. 


2.2. Implementation of the proposed method 

The implementation of method involves pre-processing of information and the analysis of properties 
that allow in real time to divide incoming sequences into segments. Figure 2 shows an example of a sequence 
of steps of a constantly learning model. The model shown in Figure 2 is a two-level one. The lower level 
processes the continuous information flow. The upper level implements procedures for constantly learning of 
the model. 


Training en eee RR Grouping of Determination of the best Monitor Clarification of the 
sample of the segments in quality indicators from onitoring sample 
properties of predetermined models 
segments y on the allocated segments t 
Training of 
predetermined | 

models on + 

iiformation highlighted Choosing the best of 
flow segments predetermined 
models for the current Result 
segment 


Figure 2. Example of a succession of steps of a continually learning model 


At the beginning, the primary training set xj, ...,X, of the information sequence is formed. Based on 
this set, individual segments are allocated, where data properties differ. In simple cases, for temporary 
sequences, to detect situations of the transformation of the data properties is possible as a procedure for 
searching for the moment 0, where there is a change in the characteristics of the observed process (change of 
trend direction, and amplitude): 


,_ (en O SE <8; 
xE= : 
xitl t > 6; 


As a result, the original sample is divided into several parts X?1,...,X?™. Their properties are 
analyzed. If the predefined parameters match, the number of segments under consideration could be reduced. 

Models Qj, Q3, ..., Ay are trained on the subsamples X. : eee x The achieved qualitative indicators 
are analyzed. On each segment X : ‘ for each model aj(x) the loss function L(a;(x), X ‘4 ‘) is determined. Its 
values make it possible to rank models {a,, a2, ...,@n} € A and assign for each segment the model that has the 
highest quality indicators. 

At the lower level, procedures for segmentation and determination of data sequence properties are 
also performed over incoming information flows. Analyzing the properties of the segments identified during 
the processing of the information flow and comparing them with the properties of the subsamples obtained 
from the training sample. Allows you to assign one of the pre-trained models {a,, 3, ...,,} € A to the current 
segment. 

At the last stage, the a;(x) model selected for the current segment is used to solve flow processing 
problems. The analysis of the real values and the values obtained by the model allows you to make a decision 
on the formation of data to refine the algorithm, which are subsequently added to the training sample. Thus, it 
is possible to implement a constantly learning model, where the processes of learning and processing 
information flows can be carried out in parallel. In the case of using complex classification or regression 
models, pre-trained models can reduce the time spent on training when changing data properties. 
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3. RESULTS AND DISCUSSION 

In order to evaluate the proposed solution for the experiment, the NSL-KDD dataset was taken. The 
NSL-KDD Test sample contains 22544 records, of which 9711 with a class of normal, 12833 anomaly traffic. 
The structure of the dataset contained more than 40 attribute values [23], [24]. When training classification 
algorithms, standard Weka settings were used. In the first part of the experiment, segmentation was carried out 
using the Ruptures library [25]. The selected change points in the time series are presented in Figure 3. 
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Figure 3. Determination of change points of segmentation presented in a multidimensional form. On the 
horizontal axis, time (Time) is presented, on the vertical axis, the volume of data (Volume) transmitted from 
the source to the destination in one connection 


As a result, network traffic was divided into several segments. Each of these segments has its own 
properties associated with the trend and statistical scope of data. On all the segments received, hoeffding tree 
(HT) and OneR classifiers were trained, and the values of indicators were determined for each of them. 

Considering the required quality indicator and using expression (5) you can choose the classifier that 
will show the best values on the segment, i.e. assign its own classifier to a specific segment. Figure 4 (in 
Appendix) shows the indicators along the entire segment, by segments, and the average values obtained when 
choosing the best classifiers. Analysis of histograms shows that using segmentation of the sample, and 
assigning classifiers with the best quality indicators, it is possible to improve the quality of processing the 
entire sample. 

Separation of sequences makes it possible to fight emissions and noise and form compactly localized 
subset in the space of objects. Using segmentation, you can increase quality indicators by about 5% compared 
to the sample in general. However, the properties of data on which regression models are trained and tested 
affect their effectiveness. 


4. CONCLUSION 

The article proposes a solution by using pre-trained and predetermined classifiers. The method is 
based on the division of the sample into separate segments, with different data properties. An analysis of 
information on changing the range of values and balance of events is used to form training samples, to improve 
the quality of models. 

Using the proposed method of the separation of data and the choice of models with the best quality 
indicators makes it possible to reduce the values of losses compared to the processing of the entire sample. The 
originality of the proposed method is that the sample is divided into separate segments, each of which has its 
own properties. Preliminary training on them algorithms makes it possible to choose and assign models with 
better quality indicators when changing the data flow properties. 

The main advantage of the proposed method is that it can adapt to the states of heterogeneous segments 
in the telecommunications network located under various operating conditions. The disadvantage is the 
sensitivity of classifying algorithms to the displacement of answers. To overcome such an effect, it is necessary 
to analyze in advance samples of segments for the possible occurrence of covarization shift in the “subsets.” 
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